Data Analysis with R - Final Project

by Samit Chaudhuri

Chemical Analysis of Red and White Wines

This project explores relationship between quality and chemical properties of wines. Wine quality is based on subjective ratings by human experts. Chemical properties are determined by objective measurements. Two tidy data sets of 1599 red wines and 4898 white wines, each with 11 input variables and 1 output variables are used.

Additionally some models are presented to predict the quality of wines from their chemical characteristics.

Load data and necessary packages

setwd("~/personal/repos/github/analyze-wines/")
redWines <- read.csv("wineQualityReds.csv")
redWines <- subset(redWines, select = -c(X))
str(redWines)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
library(ggplot2)
library(GGally)
## Warning: replacing previous import by 'grid::arrow' when loading 'GGally'
## Warning: replacing previous import by 'grid::unit' when loading 'GGally'
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following objects are masked from 'package:base':
## 
##     as.array, trimws
library(gridExtra)
qualityLevel = function(quality) {
  level = as.numeric(cut(quality, breaks=c(0, 4, 5, 6, 10)))
  level
  return(level)
}
rpalette = 5
wpalette = 1
ptfont = 10           # font size of plot title
atfont = 9            # font size of axis title
ltfont = 8            # font size of legend title

The red wine data has one dependent variable, quality, that gives each wine a score between 0 and 10. There are 11 independent variables, some of with may be correlated.

Let us first find the distribution wine quality in the sample data set. Then we use the basic histogram to categorize quality scores into 4 levels: poor, good, fine, and excellent. This new factor variable, quality.label, is computed by cutting the quality number into 4 ranges (0,3], (0,5], (5,6], (6,10]. Now we can add some color to the previous histogram.

ptfont = 10           # font size of plot title
atfont = 9            # font size of axis title
ltfont = 8            # font size of legend title
rqhist1 <- ggplot(data=redWines, aes(x=quality)) +
    ggtitle("Red Wine Quality") +     # Plot title text
    xlab("Wine Quality") + ylab("Count") +  # Axis label text
    theme_bw() +                                          # Theme elements
    theme(title=element_text(size=ptfont, face="bold"),   # Title font 
         axis.title=element_text(size=atfont, face="bold")) +
    scale_x_continuous(limits=c(3,9), breaks = seq(3, 9, 1)) + 
    geom_histogram(binwidth = 1)
redWines$quality.label = cut(redWines$quality, breaks=c(0, 4, 5, 6, 10), 
    labels=c("poor", "average", "fine", "excellent")) 
rqhist2 <- ggplot(data=redWines, aes(x=quality, fill=quality.label)) +
    ggtitle("Red Wine Categories") +     # Plot title text
    xlab("Wine Quality") + ylab("Count") +  # Axis label text
    scale_fill_brewer(type = 'div', palette = rpalette, # Legend title/color
         guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
    theme_bw() +                                          # Theme elements
    theme(title=element_text(size=ptfont, face="bold"),   # Title font 
        axis.title=element_text(size=atfont, face="bold"),    # Axis font 
        legend.title=element_blank(),  
        legend.position=c(.9,.7)) +                       # Legend position
    scale_x_continuous(limits=c(3,9)) + 
    geom_histogram(binwidth = 1)
grid.arrange(rqhist1, rqhist2, ncol=2)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 8 rows containing missing values (geom_bar).

So we have a decent categorization of excellent, fine, good, and poor red wines.

In the beginning, we don’t really know what causes a wine to be excellent or poor. So we want the entire data set speak for itself by plotting many variables all at once. Hopefully the plots will draw our attention to a smaller set of variables and relationships.

#ggpairs(redWines, params = c(color=I('red3'), shape = I('.'), corSize = 8, outlier.shape = I('.')), title = "Red Wines Scatter Plot Matrix") + theme(axis.title.x=element_text(size=8,angle=45, vjust=1), axis.title.y=element_text(size=8,angle=45, hjust=1), axis.text=element_blank())
#ggpairs(redWines, params = c(color=I('red3'), shape = I('.'), corSize = 8, outlier.shape = I('.')), title = "Red Wines Scatter Plot Matrix") + theme(size=8, axis.text=element_blank())
ggpairs(redWines, params = c(color=I('red3')), title = "Red Wines Scatter Plot Matrix") + theme(axis.text=element_blank())
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`colour`)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Some of the dependent variables are highly correlated. For example, citric acid is correlated with both fixed and volatile acidity. Density is correlated with both fixed acidity and alcohol content.

The quality score is most correlated with alcohol content and volatile acidity, but it does not correlate much with anything else. On the other hand, the box plots of the quality label (excellent, fine, good, poor) show a dependency on fix and volatile acidity; excellent wines have higher mean fixed acidity and lower mean volatile acidity.

We will closely examine the effects of alcohol and acidity (both fixed and volatile) on quality score and quality label with grouped density plots.

ptfont = 10           # font size of plot title
atfont = 9            # font size of axis title
ltfont = 8            # font size of legend title
alDensity <- ggplot(data=redWines, aes(x=alcohol, color=quality.label)) +
      ggtitle("Quality vs Alcohol") + # Plot title text
      xlab("Alcohol Content") + ylab("Count") +  
      scale_color_brewer(type = 'div', palette = rpalette, # Legend title/color
          guide=guide_legend(reverse=TRUE)) +
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(size=atfont, face="bold"),   # Axis font 
           legend.title=element_blank(), 
           legend.position=c(.85,.6)) +
      geom_density() 
acidDensity <- ggplot(data=redWines) +
      ylab("Count") +  
      scale_color_brewer(type = 'div', palette = rpalette, # Legend title/color
          guide=guide_legend(reverse=TRUE)) +
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(face="bold"),   # Axis font 
           legend.title=element_blank(), 
           legend.position=c(.85,.6))
vaDensity <- acidDensity + 
    geom_density(aes(x=volatile.acidity, color=quality.label)) + 
    ggtitle("Quality vs Volatile Acidity") +  # Plot title 
    xlab("Volatile Acidity") 
faDensity <- acidDensity + 
    geom_density(aes(x=fixed.acidity, color = quality.label)) + 
    ggtitle("Quality vs Fixed Acidity") +     # Plot title 
    xlab("Fixed Acidity") 
grid.arrange(alDensity, vaDensity, faDensity, nrow=2, ncol=2)

The density plots of different quality levels are shifted with respect to each other. The alcohol density plots shift toward right with improving quality. Better wines tend of have higher alcohol. The fixed acidity plots also shift toward right with improving quality, while volatile acidity density plots shift toward left with improving quality. Better wines tend of have higher fixed acidity and lower volatile acidity.

These trends can also be observed with box plots.

ptfont = 10           # font size of plot title
atfont = 9            # font size of axis title
ltfont = 8            # font size of legend title
albox <- ggplot(data=redWines, aes(x=quality.label, y=alcohol,
       color=quality.label)) +
      ggtitle("Alcohol vs Quality") +     # Plot title text
      xlab("Qality Label") + ylab("Alcohol Stats") +  # Axis label text
      scale_color_brewer(type = 'div', palette = rpalette) +# Legend title/color
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(size=atfont, face="bold"),   # Axis font 
           legend.title=element_text(size=ltfont, face="bold"), # Legend font 
           legend.position="none") +                      # Legend position
      geom_boxplot() 
abox <- ggplot(data=redWines, aes(x=quality.label, color=quality.label)) +
           # Plot title text
      xlab("Qality Label") +   # Axis label text
      scale_color_brewer(type = 'div', palette = rpalette)+ # Legend color
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(size=atfont, face="bold"),   # Axis font 
           legend.title=element_text(size=ltfont, face="bold"), # Legend font 
           legend.position="none")                        # Legend position
par(mfrow=c(1,2))
vabox <- abox + geom_boxplot(aes(y=volatile.acidity)) + 
     ggtitle("Volatile Acidity vs Quality") +
     ylab("Volatile Acidity Stats")
fabox <- abox + geom_boxplot(aes(y=fixed.acidity)) +
     ggtitle("Fixed Acidity vs Quality") +
     ylab("Fixed Acidity Stats")
grid.arrange(albox, vabox, fabox, nrow=2, ncol=2)

Now let’s observe the joint effect of volatile acidity and alcohol content on quality with a scatter plot. Side-by-side, we also plot the joint effect of fixed acidity and alcohol content.

alascatter <- ggplot(data=redWines, aes(x=alcohol, 
      group=quality.label, fill=quality.label, color=quality.label)) +
      scale_x_continuous(limits=c(8, 15), breaks = seq(8, 15, .5)) +
      xlab("Alcohol Percentage") + # Axis label text
      scale_color_brewer(type = 'div', palette = rpalette, 
          guide=guide_legend(reverse=TRUE)) +              # Legend color
      scale_fill_brewer(guide = FALSE) +
      theme_bw() +                                         # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),  # Title font 
           axis.title=element_text(size=atfont, face="bold"),   # Axis font 
           legend.title=element_blank(),                   # Legend font 
           legend.position=c(.85,.6))                      # Legend position
alvascatter <- alascatter + 
      ggtitle("Volatile Acidity and Alcohol Content of Red Wines") + 
      ylab("Volatile Acidity") +  
      scale_y_continuous(limits=c(0.1, 1.6), breaks = seq(0.1, 1.6, .1)) +
      geom_point(aes(y=volatile.acidity, size=3))
alfascatter <- alascatter + 
      ggtitle("Fixed Acidity and Alcohol Content of Red Wines") + 
      ylab("Fixed Acidity") +  
      scale_y_continuous(limits=c(3, 16), breaks = seq(3, 16, 1)) +
      geom_point(aes(y=fixed.acidity, size=3))
alvascatter

alfascatter

Red wine quality is definitely good when alcohol content is above 12 and volatile acidity is below 0.5. Even when alcohol content is not quite as high, one can find good wines as with low volatile acidity, or with high fixed acidity. Similarly when volatile acidity is not quite as low, or fixed acidity is not quite as high, one can find good wines with high alcohol content. In other words, as long has either alcohol content is above 12, or volatile acidity is below 8 or fixed acidity is above 8, there is a good chance that the wine has good quality.

Linear Regression Model of Red Wine Quality

First we start with a full linear regression model.

fulllm <- lm(data=redWines, quality ~ . - quality.label)
summary(fulllm)
## 
## Call:
## lm(formula = quality ~ . - quality.label, data = redWines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

Now we select only the signficant variables, and then leave out collinear variables, citric.acid, pH, and density because they are highly correlated with acidity.

redWines$quality = as.numeric(redWines$quality)
reducedlm <- lm(data=redWines, quality ~ . 
    - citric.acid - density - pH - quality.label)
summary(reducedlm)
## 
## Call:
## lm(formula = quality ~ . - citric.acid - density - pH - quality.label, 
##     data = redWines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7772 -0.3728 -0.0605  0.4520  2.0213 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.7099270  0.2332016  11.621  < 2e-16 ***
## fixed.acidity         0.0251103  0.0101305   2.479   0.0133 *  
## volatile.acidity     -1.0698952  0.1000046 -10.698  < 2e-16 ***
## residual.sugar        0.0072856  0.0120570   0.604   0.5458    
## chlorides            -1.7451754  0.3922647  -4.449 9.23e-06 ***
## free.sulfur.dioxide   0.0041719  0.0021259   1.962   0.0499 *  
## total.sulfur.dioxide -0.0031126  0.0006882  -4.523 6.56e-06 ***
## sulphates             0.8832980  0.1110258   7.956 3.35e-15 ***
## alcohol               0.2794893  0.0167556  16.680  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6499 on 1590 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:  0.3524 
## F-statistic: 109.7 on 8 and 1590 DF,  p-value: < 2.2e-16

The R-squared of the reduced model is not much different from the full model. The R-squared did not really improve with models built with other combinations of feature variables.

Prediction with Classification

One possible reason that linear regression model of red-wine quality has low accuracy is that quality has a non-linear relation with other dependent variables. This prompts us to explore the relationship with a multi-class classification trees. Before building a CART model, we will split the data into a training set and a test set. CART model built on the training set will be used to predict quality on the test set. Then we will be able to measure the accuracy of the predictive capability of the model.

library(rpart)
library(rpart.plot)
library(caret)
library(e1071)

redWines$quality = factor(redWines$quality)

# Split data into a train and test set
index <- 1:nrow(redWines)
set.seed(144)
testSamples <- sample(index, trunc(length(index)/3))
rqindex <- which(colnames(redWines)=="quality")
rqlindex <- which(colnames(redWines)=="quality.label")
redTest <- redWines[testSamples,-rqlindex]
redTrain <- redWines[-testSamples,-rqlindex]

# Create rpart model with cross validation
rtr.control = trainControl(method = "cv", number = 5)
cp.grid = expand.grid( .cp = (0:4)*0.001) 
set.seed(144)
rrpart.grid = train(data = redTrain, quality ~ ., method = "rpart", 
    trControl = rtr.control, tuneGrid = cp.grid)
rrpart.model = rrpart.grid$finalModel
prp(rrpart.model, tweak=1.5)

#Compute confusion matrix 
rrpart.cpred <- predict(rrpart.model, newdata=redTest, type = "class" )
rrpart.cmatrix <- table(pred = rrpart.cpred, true = redTest[,rqindex])
rrpart.accuracy <- sum(diag(rrpart.cmatrix))/sum(rrpart.cmatrix)
rrpart.cmatrix
##     true
## pred   3   4   5   6   7   8
##    3   0   0   0   0   0   0
##    4   0   0   0   0   0   0
##    5   1  15 166  62   4   0
##    6   1   7  61 119  38   2
##    7   0   1  10  21  22   3
##    8   0   0   0   0   0   0
rrpart.accuracy
## [1] 0.575985

Indeed, the CART model can predict quality on the test set with 58% accuracy.

Chemical Analysis of White Wines

We now perform EDA on white wine quality and see if they follow a similar pattern as red wines.

whiteWines <- read.csv("wineQualityWhites.csv")
whiteWines <- subset(whiteWines, select = -c(X))
str(whiteWines)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
whiteWines$quality.label = cut(whiteWines$quality, breaks=c(0, 4, 5, 6, 10), 
    labels=c("poor", "average", "fine", "excellent"))
wqhist1 <- ggplot(data=whiteWines, aes(x=quality)) +
      ggtitle("White Wine Quality") +     # Plot title text
      xlab("Wine Quality") + ylab("Count") +  # Axis label text
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(size=atfont, face="bold")) +
      scale_x_continuous(limits=c(3,9), breaks = seq(3, 9, 1)) +  
      geom_histogram(binwidth = 1)
wqhist2 <- ggplot(data=whiteWines, aes(x=quality, fill=quality.label)) +
          ggtitle("White Wine Categories") +     # Plot title text
          xlab("Wine Quality") + ylab("Count") +  # Axis label text
          scale_fill_brewer(type = 'div', palette = wpalette, 
              guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
          theme_bw() +                                          # Theme elements
          theme(title=element_text(size=ptfont, face="bold"),   # Title font 
             axis.title=element_text(size=atfont, face="bold"),   # Axis font 
             legend.title=element_text(size=ltfont, face="bold"), # Legend font 
             legend.position=c(.9,.7)) +                      # Legend position
          scale_x_continuous(limits=c(3,9)) + 
          geom_histogram(binwidth = 1)
grid.arrange(wqhist1, wqhist2, ncol=2)
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 8 rows containing missing values (geom_bar).

Now plot all vriable pairs in the data set to identify noticable relationships.

#ggpairs(whiteWines, 
#       params = c(color=I('green3'), shape = I('.'), corSize=8, 
#            outlier.shape = I('.')),
#       title = "White Wines Scatter Plot Matrix")

Similar to quality or red wines, quality of white wines also relate to alcohol content and acidity. We can examine these relations with scatterplots.

alascatter <- ggplot(data=whiteWines, aes(x=alcohol, 
      group=quality.label, fill=quality.label, color=quality.label)) +
      scale_x_continuous(limits=c(8, 15), breaks = seq(8, 15, .5)) +
      xlab("Alcohol Percentage") +  # Axis label text
      scale_color_brewer(type = 'div', palette = wpalette,  # Legend color
          guide=guide_legend(reverse=TRUE, title="Quality Rating")) +
      scale_fill_brewer(guide = FALSE) +
      theme_bw() +                                          # Theme elements
      theme(title=element_text(size=ptfont, face="bold"),   # Title font 
           axis.title=element_text(size=atfont, face="bold"),   # Axis font 
           legend.title=element_text(size=ltfont, face="bold"), # Legend font 
           legend.position=c(.85, .6))                    # Legend position
alvascatter <- alascatter +
      ylab("Volatile Acidity") +
      scale_y_continuous(limits=c(0.08, 1.1), breaks = seq(0.08, 1.1, .1)) +
      ggtitle("Volatile Acidity and Alcohol Content of White Wines") + 
      geom_point(aes(y=volatile.acidity), size=3)
alfascatter <- alascatter +
      ylab("Fixed Acidity") +
      scale_y_continuous(limits=c(3, 16), breaks = seq(3, 16, 1)) +
      ggtitle("Fixed Acidity and Alcohol Content of White Wines") + 
      geom_point(aes(y=fixed.acidity), size=3)
alvascatter

alfascatter

We can also create a classification tree for prediction. The model can predict quality on the test set with 52% accuracy. The CART is drawn below.

library(rpart)
library(rpart.plot)
library(caret)
library(e1071)
 
whiteWines$quality = factor(whiteWines$quality)

# Split data into a train and test set
index <- 1:nrow(whiteWines)
set.seed(144)
wtestSamples <- sample(index, trunc(length(index)/3))
wqindex <- which(colnames(whiteWines)=="quality")
wqlindex <- which(colnames(whiteWines)=="quality.label")
whiteTest <- whiteWines[wtestSamples,-wqlindex]
whiteTrain <- whiteWines[-wtestSamples,-wqlindex]

# Create rpart model with cross validation
wtr.control = trainControl(method = "cv", number = 5)
wcp.grid = expand.grid( .cp = (0:4)*0.001) 
set.seed(144)
wrpart.grid = train(data = whiteTrain, quality ~ . - citric.acid - density 
    - pH , method = "rpart", trControl = wtr.control, tuneGrid = cp.grid)


wrpart.model = wrpart.grid$finalModel
prp(wrpart.model)

#Compute confusion matrix 
wrpart.cpred <- predict(wrpart.model, whiteTest, type = "class" )
wrpart.cmatrix <- table(pred = wrpart.cpred, true = whiteTest[,wqindex])
wrpart.accuracy <- sum(diag(wrpart.cmatrix))/sum(wrpart.cmatrix)
wrpart.cmatrix
##     true
## pred   3   4   5   6   7   8   9
##    3   0   0   0   0   0   0   0
##    4   0   0   1   1   0   0   0
##    5   2  34 272 154  15   0   0
##    6   5  25 199 542 245  49   0
##    7   1   2   5  27  44   9   0
##    8   0   0   0   0   0   0   0
##    9   0   0   0   0   0   0   0
wrpart.accuracy
## [1] 0.5257353

```

Interpretation of Model

More alcohol and higher quality usually go together. In addition, volatile acidity degrades quality, probably because it induces a vinegary taste in the wine. Residual sugar levels are somewhat important probably because sweetness is often associated with freshness. Sulphates are important too, possibly because of its link to fermentation.

Final Plots and Summary

This report explored the relationship betwen chemical properties of red and white wines, and their quality as identified by wine experts. Several plots including histograms, density plots, box plots, and scatter plots were used in the exploration to identify three major featurs impacting wine quality. They are alcohol content, fixed acidity, and volatile acidity.

This section reproduces a few of those plots. For example, the density plots of red wine quality over the three features are shown below. Detailed code for generating these plots have already been included in earlier sections.

grid.arrange(alDensity, vaDensity, faDensity, nrow=2, ncol=2)

For white wines, the relationsip between quality, alcohol content and volatile acidity can be seen in the following scatter plot.

alvascatter

We also built regression and classification models to predict wine quality from its chemical properties. The classification trees identified two additional significant features, namely residual sugar and sulphates. In the following we show the classification-tree for white wines. Detailed code for generating the model have already been included in an earlier section.

prp(wrpart.model)

Reflection

In this report, we have explored the relationship betwen chemical properties of wines, such as fixed and volatile acidity, residual sugar, sulphates, alcohol content, and the quality of the wines as identified by the experts. We started with 11 different independent input variables, and showed that one can pick fine or excellent wines by focusing on 5 or fewer primary contributors. However, our prdictive models are not accurate enough (58% for red and 52 % for white). Despite providing interesting insights, they need to more refined before providing reliable wine ratings. In addition to linear regression and CART models used in this project, there exists other advanced techniques for extracting features and building and training models that can be used to create a better predictive model for wine quality.